Maximum likelihood: extracting unbiased information from complex networks 
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The choice of free parameters in network models is subjective, since it depends on what topological 
properties are being monitored. However, we show that the Maximum Likelihood (ML) principle 
indicates a unique, statistically rigorous parameter choice, associated to a well defined topological 
feature. We then find that, if the ML condition is incompatible with the built-in parameter choice, 
network models turn out to be intrinsically ill-defined or biased. To overcome this problem, we 
construct a class of safely unbiased models. We also propose an extension of these results that leads 
to the fascinating possibility to extract, only from topological data, the 'hidden variables' underlying 
network organization, making them 'no more hidden'. We test our method on the World Trade Web 
data, where we recover the empirical Gross Domestic Product using only topological information. 
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In complex network theory, graph models are system- 
atically used either as null hypotheses against which real- 
world networks are analysed, or as testbeds for the vali- 
dation of network formation mechanisms Until now 
there has been no rigorous scheme to define network mod- 
els. However, here we use the Maximum Likelihood (ML) 
principle to show that undesired statistical biases natu- 
rally arise in graph models, which in most cases turn out 
to be ill-defined. We then show that the ML approach 
constructively indicates a correct definition of unbiased 
models. Remarkably, it also allows to extract hidden 
information from real networks, with intriguing conse- 
quences for the understanding of network formation. 

The framework that we introduce here allows to solve 
three related, increasingly complicated problems. First, 
we discuss the correct choice of free parameters. Model 
parameters are fixed in such a way that the expected 
values (i.e. ensemble averages over many realizations) 
of some 'reference' topological property match the em- 
pirically observed ones. But since there are virtually as 
many properties as we want to monitor in a network, and 
surely many more than the number of model parameters, 
it is important to ask if the choice of the reference prop- 
erties is arbitrary or if a rigorous criterion exists. We 
find that the ML method provides us with a unique, sta- 
tistically correct parameter choice. Second, we note that 
the above ML choice may conflict with the structure of 
the model itself, if the latter is defined in such a way 
that the expected value of some property, which is not 
the correct one, matches the corresponding empirical one. 
We find that the ML method identifies such intrinsically 
ill-defined models, and can also be used to define safe, 
unbiased ones. The third, and perhaps most fascinating, 
aspect regards the extraction of information from a real 
network. Many models are defined in terms of additional 
'hidden variables' 0, d, 0, Q associated to vertices. The 
ultimate aim of these models is to identify the hidden 
variables with empirically observable quantities, so that 
the model will provide a mechanism of network forma- 
tion driven by these quantities. While for a few networks 
this identification has been carried out successfully [1, 0] , 
in most cases the hidden variables are assigned ad hoc. 



However, since in this case the hidden variables play es- 
sentially the role of free parameters, one is led again to 
the original problem: if a non-arbitrary parameter choice 
exists, wc can infer the hidden variables from real data. 
As a profound and exciting consequence, the quantities 
underlying network organization are 'no more hidden'. 

In order to illustrate how the ML method solves this 
three-fold problem successfully, we use equilibrium graph 
ensembles as an example. All network models depend on 
a set of parameters that we collectively denote by the 
vector 9. Let P{G\9) be the conditional probability of 
occurrence of a graph G in the ensemble spanned by the 
model. For a given topological property 7r(G) displayed 
by a graph G, the expected value (Tr)^- reads 



(7r),-^^^(G)P(G|0l 



(1) 



In order to reproduce a real-world network A, one usually 
chooses some reference properties {Tr^ji and then sets 9 
to the 'matching value' 9m such that 



(2) 



Our first problem is: is this method statistically rigorous? 
And what properties have to be chosen anyway? A simple 
example is when a real undirected network A with N ver- 
tices and L undirected links is compared with a random 
graph where the only parameter is the connection prob- 
ability 9 — p. The common choice for p is such that the 
expected number of links {L)p = pN{N ~ l)/2 equals the 
empirical value L, which yields pm = 2L/N{N — 1). But 
one could alternatively choose p in such a way that the 
expected value (G) of the clustering coefhcient matches 
the empirical value G, resulting in the different choice 
Pm = C . Similarly, one could choose any other reference 
property tt, and end up with different values of p. There- 
fore, in principle the optimal choice of p is undetermined, 
due to the arbitrariness of the reference property. 

However, we now show that the ML approach indicates 
a unique, statistically correct parameter choice. Con- 
sider a random variable v whose probability distribution 
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f{v\6) depends on a parameter 6. For a physically re- 
alized outcome v — v' , f{v'\9) represents the likelihood 
that v' is generated by the parameter choice 9. There- 
fore, for fixed v', the optimal choice for 6 is the value 
6* maximizing f{v'\6) or cquivalently X{d) = log f{v'\9). 
The ML approach avoids the drawbacks of other fitting 
methods, such as the subjective choice of fitting curves 
and of the region where the fit is performed. This is par- 
ticularly important for networks, often characterized by 
broad distributions that may look like power laws with 
a certain exponent (subject to statistical error) in some 
region, but that may be more closely reproduced by an- 
other exponent or even by different curves as the fitting 
region is changed. By contrast, the ML approach always 
yields a unique and rigorous parameter value. Examples 
of recent applications of the ML principle to networks can 
be found in [1, [9|. In our problem, the likelihood that a 
real network A is generated by the parameter choice 9 is 



X{0) = log P{A\0) 
and the ML condition for the optimal choice 9* is 



yx{9*) 



dXi9) 
d9 



(3) 



(4) 
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This gives a unique solution to our first problem, 
instance, in the random graph model we have 

Writing the likelihood function X{p) = logP(^|p) and 
looking for the ML value p* such that X'{p*) — yields 



(5) 



2L 



N{N - 1) 



(6) 



Therefore we find that the ML value for p is the one we 
obtain by requiring (L) — L. In general, different ref- 
erence quantities (for instance the clustering coefficient) 
would not yield the statistically correct ML value. 

For the random graph model the above correct choice 
is also the most frequently used. However, more com- 
plicated models may be intrinsically ill-defined, as there 
may be no possibility to match expected and observed 
values of the desired reference properties without violat- 
ing the ML condition. This is the second problem we 
anticipated. To illustrate it, it is enough to consider a 
slightly more general class of models, obtained when the 
links between all pairs of vertices i, j are drawn with dif- 
ferent and independent probabilities Pij{9) 0, H, 0, [1]. 
Now 

Pirn = Hp.. [1 - (7) 

where the product runs over vertex pairs (i,j), and 
aij = 1 if i and j are connected in graph A, and = 
otherwise. Then eq.([3]) becomes 



i<j 



-P^30) 



Y,^og[i~p.m (8) 



For instance, in hidden variable models 0, 01 Pij is 
a function of a control parameter = z and of some 
quantities Xi, Xj that we assume fixed for the moment. 
As a first example, consider the popular bilinear choice 

aim 

Pij{z) ^ zXiXj (9) 
Writing X{z) ~ log P{A\z) as in eq.® and deriving yields 



d'ij ( 1 dij ^XiXj 

1 ^ ^ -T^ - If* ■ 

A. aj tXj ^ "-'^ 



= 



(10) 



Since X]i<j "^y — -^i ^^le condition for z* becomes 



L = y(l-ay)^^^ 

Z_/^ ■''1 _ y*r^.rp . 



(11) 



This shows that if we set z = z* ^ then L is in general dif- 
ferent from the expected value = Si<jPy(^*) ^ 
'YliKj z*XiXj. This means that if we want the ML con- 
dition to be fulfilled, we cannot tune the expected num- 
ber of links to the real one! Viceversa, if we want the 
expected number of links to match the empirical one, 
we have to set z to a value different from the statisti- 
cally correct z* one. The problem is particularly evident 
since, setting Xi = {h)/\/{L), eq.® can be rewritten 
as Pij = {ki){kj)/{2{L)) Q. So, in order to reproduce 
a network with L links we should paradoxically set the 
built-in parameter (L) = {2z)~^ to a ML value which 
is different from L. In analogy with the related problem 
of biased estimators in statistics, we shall define a biased 
model any such model where the use of eq.((2]) to match 
expected and observed properties violates the ML condi- 
tion. As a second example, consider the model [1, [l^, [Tlj 



P^J{z) 



1 I ZXiXj 

Writing A(z) and setting X'{z*) = now yields 

Z Xi X J 



! z X iX J 



(12) 



(13) 



which now coincides with {L) = J2i<j Pijiz*)^ showing 
that this model is unbiased: the ML condition @ and the 
requirement (L) = L are equivalent. In a previous paper 
, we showed that this model reproduces the properties 
of the World Trade Web (WTW) once Xi is set equal 
to the Gross Domestic Product (GDP) of the country 
represented by vertex i. The parameter z was chosen as 
in eq. p^ f^, and now we find that this is the correct 
criterion. We shall again consider the WTW later on. 

The above examples show that while some models are 
unbiased, others are 'prohibited' by the ML principle. 
The problem of bias potentially underlies all network 
models, and is therefore of great importance. Is there 
a way to identify the class of safe, unbiased models? We 
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now show that one large class of unbiased models can 
be constructively defined, namely the exponential ran- 
dom graphs traditionally used by sociologists 1^ isl and 
more recently considered by physicists [ll|, [l4 "^^5 . Il6l | . If 
{TTi}i is a set of topological properties, an exponential 
model is defined by the probability 



PiG\e) 



-H(G\6 



(14) 



where il(G\d) = graph Hamiltonian 

a.nd Z(9) = y^r.exp\—H(G\9)] is the partition function 
[ill [ij, [I^, In the standard approach, one chooses 

the matching value 9]\i fitting the properties of a real 
network. In order to check whether this violates the ML 
principle, we need to look for the value 6* maximizing 
the likelihood to obtain a network described by a given 
set {TTiji of reference properties. The likelihood function 
we have defined reads A(^) = logF(A|^) = -H{A\9) - 
logZ{9) and eq.© gives for 9* 



'dX{9) 




89, 


9=6' 



1 dZ{9) 



(15) 



Z{9) 99, 

whose solution yields the ML condition 

7r,(A) - ^.(G)e-^(«l«~*V^(r ) = {iT,)g, \Ji (16) 

G 

which is equivalent to eq.([2]): remarkably, 9* — 9m and 
the model is unbiased. We have thus proved a remark- 
able result: any model of the form in eq.(|14p is unbiased 
under the ML principle, if and only if all the proper- 
ties {TTiji included in H are simultaneously chosen as the 
reference ones used to tune the parameters 9. The statis- 
tically correct values 9* of the latter are the solution of 
the system of (in general coupled) equations (fT6|) . There 
are as many such equations as the number of free pa- 
rameters. This gives us the following recipe: if we are 
defining a model whose predictions will be matched to a 
set of properties {TTi{A)}i observed in a real-world net- 
work A, we should decide from the beginning what these 
reference properties are, include them in H{G\9) and de- 
fine P{G\9) as in ea. fTH) . In this way we are sure to 
obtain an unbiased model. The random graph is a triv- 
ial special case where t^{A) = L and H{G\9) = 9L with 
p = {\ -\- e^)~^ and this is the reason why it is un- 
biased, if L is chosen as reference. The hidden-variable 
model defined by ea. p2)) is another special case where 
TT,{A) = k, and H(G\9) = Y.,^,h with x, = e"^' [ll|, 
and so it is unbiased too. By contrast, eq.® cannot be 
traced back to ea. p^ . and the model is biased. Once 
the general procedure is set out, one can look for other 
special cases. The field of research on exponential ran- 
dom graphs is currently very active[T]].[l^ll5lfl6l[T7l[l^. 
and models including correlations and higher-order prop- 
erties are being studied, for instance to explo re g raphs 
with nontrivial reciprocity [l3| and clustering [l8| |. For 



each of these models, our result ([16]) directly yields the 
unbiased parameter choice in terms of the associated ref- 
erence properties. 

We can now address the third problem. In the cases 
considered so far we assumed that the values of the hid- 
den variables {xi}i were pre-assigned to the vertices. 
This occurs when we have a candidate quantity to iden- 
tify with the hidden variable However we can 
reverse the point of view and extend the ML approach so 
that, without any prior information, the hidden variables 
are included in 9 and treated as free parameters them- 
selves, to be tuned to their ML values {x*}i. In this 
way, hidden variables will be no longer 'hidden', since 
they can be extracted from topological data. This is an 
exciting possibility that can be applied to any real net- 
work. Moreover, this extension of the parameter space 
also allows us to match N additional properties besides 
the overall number of links. However, the unbiased choice 
of these properties must be dictated by the ML principle. 

For instance, let us look back at the model defined in 
ea.(|12p. now considering Xi and Xj not as fixed quantities, 
but as free parameters exactly as z, to be included in 9. 
Deriving X{9) — X{z, xi, . . . , x^r) with respect to z gives 
again ea. psp with Xi replaced by x*, and deriving with 
respect to Xi yields the N additional equations 



EZ Xj Xj 
1 + Z*X*X* 



(17) 



Therefore we find that the A'' correct reference properties 
for this model are the degrees: {ki)^, = ^j^iPij{9*) = 
ki. This is not true in general: the model ^ would imply 
different reference properties such that {ki) ^ ki, so that 
choosing the degrees as the properties to match would 
bias the parameter choice. Again, this difference arises 
because eq. ([T7l) corresponds to eq. ([T6| for the exponen- 
tial model H{G\9) 



ll|, while the model in 
eq.® cannot be put in an exponential form. We stress 
that, although eq.ljTT]) is formally identical to the familiar 
expression yielding {ki) as a function of {xi}i if the lat- 
ter are fixed [ll|, its meaning here is completely reversed: 
the degrees ki are fixed by observation and the unknown 
hidden variables are inferred from them through the ML 
condition. This is our key result. Note that, although 
determining the x* 's requires to solve the N + 1 coupled 
equations (|13p and (|17p . the number of independent ex- 
pressions is much smaller since: i) eas.(|17p automatically 
imply eg. psp . so we can reabsorbe z* in a redefinition 
of x* and discard eq. (fT5)) : ii) all vertices with the same 



degree k obey equivalent equations and hence are associ- 
ated to the same value x^. So egs. p?)) reduce to 



E 

fc' 



P{k') 



1 + 44' 



i + (4)' 



(18) 



where P{k) is the number of vertices with degree fc, the 
last term removes the self-contribution of a vertex to its 
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FIG. 1: ML hidden variables (a;*) versus GDP rescaled to the 
mean (wi) for the WTW (year 2000), and linear fit. 



niche values directly from empirical food webs, and not 
from ad hoc statistical distributions [l^. Another inter- 
esting application is to gene regulatory networks, where 
the length of regulatory sequences and promoter regions 
have been shown to determine the connection probabil- 
ity pij [23| • Similarly, our approach allows to extract the 
vertex-specific quantities (such as expansiveness, actrac- 
tiveness or mobility-Telated parameters) that are com- 
monly assumed to determine the to polo gy and commu- 
nity structure of social networks [l^. Il3l. [2ll | . In all these 
cases, the hypotheses can be tested against real data by 
plugging any particular form ofpij — p{xi,Xj) into eq.(I5]) 
and looking for the values {x*}i that solve eq.(|4]), i.e. 



own degree, and k and k' take only their empirical values. 
Hence the number of nonequivalent equations equals the 
number of distinct degrees that are actually observed, 
which is always much less than N . 

We can test our method on the WTW data, since 
from the aforementioned previous study we know that 
the GDP of each country plays the role of the hidden 
variable x^, and that the real WTW is well reproduced 
by eq.((T2]) |6|]. We can first use eq.dH]) to find the val- 
ues {x*}i by exploiting only topological data (the degrees 
{ki\i), and then compare these values with the empirical 
GDP of each country i (which is independent of topo- 
logical data), rescaled to its mean to factor out physical 
units. As shown in figd] the two variables ideed display 
a linear trend over several orders of magnitude. There- 
fore our method identifies the GDP as the hidden vari- 
able successfully. Clearly, our approach can be used to 
uncover hidden factors from other real-world networks, 
such as biological and social webs. An example is that of 
food web [19|] models, where it is assumed that predation 
probabilities depend on hypothetical niche values rii as- 
sociated to each species. Our formalism allows to extract 
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a^j -p{x*,x*) 
P«:a;*)[l -p(a;*,a;*)] 



dp{xi,Xj 



dxi 



Vi 



(19) 

Note that for eq. (fT2|) one correctly recovers eq. (fT7|) . Once 
obtained, the values {x*}i can be compared with the (to- 
tally independent) empirical ones to check for significant 
correlations, as we have done for the GDP data. Clearly, 
an important open problem to address in the future is 
understanding the conditions under which cq. (jl9[) . and 
similarly eq. (|18p for a generic P(fc), can be solved. 

We have shown that the ML principle indicates the 
statistically correct parameter values of network models, 
making the choice of reference properties no longer arbi- 
trary. It also identifies undesired biases in graph models, 
and allows to overcome them constructively. Most impor- 
tantly, it provides an elegant way to extract information 
from a network by uncovering the underlying hidden vari- 
ables. This possibility, that we have empirically tested 
in the case of the World Trade Web, opens to a variety 
of applications in economics, biology, and social science. 

After submission of this article, we got aware of later 
studies based on a similar idea 0, [l^l . 
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